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Abstract. In statistics, series of ordinary least squares problems (OLS) are used to study 
the linear correlation among sets of variables of interest; in many studies, the number of such 
variables is at least in the millions, and the corresponding datasets occupy terabytes of disk 
space. As the availability of large-scale datasets increases regularly, so does the challenge in 
dealing with them. Indeed, traditional solvers—which rely on the use of “black-box” routines 
optimized for one single OLS—are highly inefficient and fail to provide a viable solution for big- 
data analyses. As a case study, in this paper we consider a linear regression consisting of two- 
dimensional grids of related OLS problems that arise in the context of genome-wide association 
analyses, and give a careful walkthrough for the development of OLS-GRID, a high-performance 
routine for shared-memory architectures; analogous steps are relevant for tailoring OLS solvers 
to other applications. In particular, we first illustrate the design of efficient algorithms that 
exploit the structure of the OLS problems and eliminate redundant computations; then, we 
show how to effectively deal with datasets that do not ht in main memory; finally, we discuss 
how to cast the computation in terms of efficient kernels and how to achieve scalability. 
Importantly, each design decision along the way is justified by simple performance models. 
OLS-GRID enables the solution of 10^^ correlated OLS problems operating on terabytes of data 
in a matter of hours. 

Keywords: Linear regression, ordinary least squares, grids of problems, genome wide associ¬ 
ation analysis, algorithm design, out-of-core, parallelism, scalability 


1 Introduction 

Linear regression is an extremely common statistical tool for modeling the relationship between two 
sets of data. Specifically, given a set of “independent variables” xi, 0 : 2 , ■ ■ •, Xp, and a “dependent 
variable” y, one seeks the correlation terms f3i,i = 1,... ,p in the linear model 

y = Pixi + .../3pXp. (1) 

In matrix form, Eq. Q is expressed as y = Xj3 + e, where y G i?" is a vector of n “observations”, 
the columns of X G are “predictors” or “covariates”, the vector /? = [Pi,..., /3p]'^ contains the 

“regression coefficients”, and e is an error term that one wishes to minimize. In many disciplines, 
linear regression is used to quantify the relationship between one or more y’s from the set y, and 
each of many x^s from the set X. The computational challenges raise from the all-to-all nature of 
the problem (estimate how strongly each of the covariates is related to each of the observations), 
and from the sheer size of the datasets y and X, which often cannot be stored directly in main 
memory. 


One of the standard approaches to fit the model Q to given y and X is by solving an ordinary 
least squares (OLS) problem; in linear algebra terms, this corresponds to computing the vector /3 
such that 

In typical datasets, m, the number of available covariates (rh = \X\), is much larger than p, the 
number of variables actually used in the model. In this case, a group oi I < p covariates is kept 
fixed, and the remaining p — I slots are filled from X, in a rotating fashion; it is not uncommon 
that the value p — I is very small, often just one, thus originating m > m distinct OLS problems. 
Mathematically, this means computing a series of f3i’s such that 

Pi = ^i) ^ y> where f = 1,...,m; (2) 

here Xi consists of two parts: X]^, which contains I columns and is fixed across all m OLS problems, 
and Xr., which instead contains p — I columns taken from X. In many applications, m can be of 
the order of millions or even more. 

When t > 1 dependent variables {t = |3^|) are to be studied against X, the problem ([^ assumes 
the more general form 

A, = (3) 

where i = 1,..., m, and j = 1,..., t, indicating that one has to compute a two-dimensional grid of 
Aj’s, each one corresponding to an OLS problem. This is for instance the case in genomics (multi¬ 
trait genome-wide association analyses) [1] and econometrics (explanatory variable exploration) [2] . 

Despite the fact that OLS solvers are provided by many libraries and languages (e.g., LAPACK, 
NAG, MKL, Matlab, R), no matter how optimized those are, any approach that aims at computing 
the 2D grid ([^ via t x m invocations of a “black-box” routine is entirely unfeasible. The main 
limitations come from the fact that this approach leads to the execution of inefficient and redundant 
operations, lacks a mechanism to effectively manage data transfers from and to hard disk, and 
underutilizes the resources on parallel architectures. 

In this paper, we consider an instance of Eq. ([^ as it arises in genomics, and develop OLS-GRID, 
a parallel solver tailored for this application. Specifically, we focus on the study of omics datcj^in the 
context of genome wide association analyses (GWAA) Omics GWAA study the relation between 
m groups of genetic markers and t phenotypic traits in populations of n individuals. In terms of 
OLS, each trait is represented by a vector pj containing the trait measurements (one per individual); 
each matrix Xi = [Xl\Xr.] is composed of a set of I fixed covariates such as sex, age, and height 
(Xl), and one of the groups oi r — p — I markers {Xb..). A positive correlation between markers 
Xfl. and trait pj indicates that the markers may have an impact in the expression of the trait. 

Typical problem sizes in omics GWAA are roughly 10^ < n < 10®, 2 < p < 20 (with r = 1 
or 2), 10® < m < 10®, and 10^ < t < 10®. An exemplary analysis with size n = 30,000, p = 10 
{I = 8, r = 2), m = 10^, and t = 10"*, poses three considerable challenges. First, it requires the 
computation of 10^^ OLS problems, which, if tackled by a traditional “black-box” solver, would 
perform 0(10^®) floating point operations (flops). Despite the fact that the problem lends itself to 
a lower computational cost and efficient solutions, a black-box solver ignores the structure of the 
problem and requires large clusters to obtain a solution. The second challenge is posed by the size of 
the datasets to be processed: assuming single-precision data (4 bytes per element), a GWAA solver 
reads as input about 2.4 TBs of data and produces as output 4 TBs of data. If the data movement 
is not analyzed properly, the time spent in I/O transfers might render the computation unfeasible. 

^ With the term omics we refer to large-scale analyses involving at least hundreds of traits mm. 

^ GWAA are often also referred to as genome wide association studies (GWAS) and whole genome associ¬ 
ation studies (WGAS). 
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Finally, the computation needs to be parallelized and organized so that the potential of the current 
multi-core and many-core architectures is fully exploited. 

Contributions. This paper is concerned with the design and the implementation of OLS-GRID, a 
high-performance algorithm for large-scale linear regression. While we use omics GWAA as a case 
study, the discussion is relevant to a range of OLS-based applications. Specifically, we 1) illustrate 
how to take advantage of the specific structure in the grid of OLS problems to design specialized 
algorithms, 2) analyze the data transfers from and to disk to effectively deal with large datasets 
that do not fit in main memory, and 3) discuss how to cast the computation in terms of efficient 
kernels and how to attain scalability on many-core architectures. Moreover, by making use of simple 
performance models, we identify the performance bottlenecks with respect to the problem size. 
OLS-GRID, available as part of the GenABEL suite [S], allows one to execute an analysis of the 
aforementioned size in less than 7 hours on a 40-core node. 

Related work. Genome-wide association analyses received a lot of attention in the last decade [7]. 
Numerous high-impact findings have been reported, including but not limited to the identification 
of genetic variations associated to a common form of blindness, type 2 diabetes, and Parkinson’s 
disease mnm- a popular approach to GWAA is the so called Variance Components model, 
which boils down to a set of equations similar to Eq. ([^ . The main difference with the present work 
lies on the core equation, where one has to solve grids of generalized least squares (GLS) problems 
instead of grids of OLSs. A number of libraries have been developed to carry out GLS-based GWAA, 
the most relevant being FaST-LMM, GEMMA, GWEGLS, and OmicABEL |6I12I13I5] . 

OmicABEL, developed within our research group, showed remarkable performance improvements 
with respect to the other existing methods mm- Clearly, the same library can be used, by set¬ 
ting the covariance matrix to the identity, to solve Eq. ([^. While possible, this is not advisable: 
OmicABEL reduces the two-dimensional grid of GLS problems to the grid of OLS problems 

k, = 

which is deceivingly similar to Eq. @ ; however, the subtle differences in the dependencies (subindices) 
of the design matrix X lead to a more expensive and less efficient algorithm. In Sec. we show how 
the new OLS-GRID outperforms OmicABEL-Eig considerably. 

A number of tools are focused on OLS-based linear regression analyses for GWAA; among them, 
we mention ProbABEL (also part of the GenABEL suite), GWASP, and BOSS |16I17I18] . GWASP 
stands out for its elegant algorithmic approach and a performance-oriented design; a more detailed 
discussion is given in Sec. |5.4| 

Organization of the paper. The remainder of the paper is organized as follows. In Sec. we describe 
the design of an efficient algorithm that exploits problem-specific knowledge. In Sec. we analyze 
the data transfers and discuss possible limitations inherent to the problem at hand, while in Sec. 
we focus on the high performance and scalability of our software. Section presents performance 
results. Einally, Sec. draws conclusions and discusses future work. 

2 Prom the problem specification to an efficient algorithm 

In this section we discuss the steps leading to the core algorithm behind OLS-GRID. Starting from 
an elementary and generic algorithm, we incrementally refine it into an efficient algorithm tailored 
specifically for grids of OLS arising in GWAA studies. The focus is on the reduction of the compu¬ 
tational complexity. 
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As a starting point to solve Eq. ^ , we consider a most naive black-box solver consisting of two 
nested loops (for each i and j) around a QR-based algorithm for OLS [Ti^ : 

{Q,R} -qriX) 
b := R~^Q^y. 


Since the design matrix X only depends on the index i, the loop ordering j makes it possible to 
compute the QR factorization once and reuse it across the j loop. Even with this simple code motion 
optimization, this first algorithm performs substantial redundant calculations due to the structure 
of X,. 

Recall that each matrix X^ can be logically seen as consisting of two parts where 

Xl is constant, and X^. varies across different instances of Xi. Aware that the QR factorization 
of Xi has to be computed for each i, the question is whether or not such structure is exploitable. 
The answer lies in the “Partitioned Matrix Expression” (PME) of the QR factorization: for a given 
operation, the PME reveals how (portions of) the output operands can be expressed in terms of 
(portions of) the input operands pO] . 

In this specihc case, we wonder if and how the vertical partitioning of X propagates to the 
computation of Q and R. Consider 



(4) 


where Ql € R^^\Qr € R^^\Rtl € R^^\Rtr e and Rrr G ■ By rewriting Eq. Q as 

{^QlRtL = Xr I QlRtr + QrRbr = Xr^ , 
and using the orthogonality of Q {Q]]Qr — 0), one derives the assignments 


{Ql,Rtl} ■= qr{XL) 

Rtr ■= Q^Xr (5) 

{Qr, Rbr] '■= qr{XR - QlRtr), 


which indicate that the factorization oi Xr can be computed only once and re-used as Xr varies. 

In Alg.[2all the observations made so far are incorporated. In particular, the loop ordering is set 
to i, j, and code motion is applied whenever possible, to avoid redundant calculations. Each line of 
the algorithm is annotated with the corresponding BLAS or LAPACK kernel and its computational 
cost. Realizing that not only Xr, but also Qr and Rtr are constant (they only depend on X^), we 
can now deliver a more sophisticated algorithm which saves even more flops. 

As a direct effect of the partitioning of Qt in {Ql\Qr-) in line 7, the vector Zij can be decomposed 
as 

^ / Qlvj \ 


suggesting that the top part (zTj := QAj) be precomputed, once per vector yj, and then 
reused. 

Similarly, the structure of Ri can be exploited to avoid redundant computation within the trian¬ 
gular system in line 8. By using the same top-bottom splitting for Zij, and partitioning bj accordingly, 
we obtain the expression 


Rtl 

RTRi 

0 

RBRi 




4 









Algorithm 1 : Structure of X exposed and exploited 


1 

{Ql,Rtl} ’■= qr{Xx) 


(qr) 

2nl^ 

2 

for i := 1 to m do 




3 

RxRi ■= Q"lXr. 


(gemm) 

2mlnr 

4 

Ti ~ Xr. — QhRxRi 

(gemm) 

2mlnr 

5 

{QRi.RBRi} ~ qr{Ti) 

(qr) 

2mnr^ 

6 

for j := 1 to t do 




7 

Us.. 

1 Vi 

(gemv) 

2mtpn 

s 


RTRi\ 

(trsv) 

mtp^ 



I ^^3 
RbrJ 

9 

end for 





10: end for 


which can be rewritten as 

Rtl^T + RTRibs = ZTj 
RsRibB = ZBij 

Straightforward manipulation suffices to show that bx and bs can be computed as 

bBij ■= RsRi^BB 

bxij '■= RjxizTj - RTRibB,j)- ( 6 ) 

Given the dependencies with the loop indices, the top part of b can be partially precomputed by 
first moving the operation kj := to the same initial loop where zt^ '■= Q^lUj is precomputed, 

and then moving the operation hi := R^^^Rtr^ out of the innermost loop. The resulting algorithm 
is displayed in Alg. As discussed, zt^ and kj are precomputed in an initial loop (lines 2-5), and 
kj is kept in memory (a few megabytes at most) for later use within the nested double-loop (line 
14). Also, the computation of hi is taken out of the innermost loop (line 10) and reused within the 
innermost loop (line 14). The cost of Alg. [^is dominated by the term 2mtrn, that is, a factor of 
p/r fewer operations with respect to Alg. 

3 Out-of-core algorithm: Analysis of data streaming 

As mentioned in Sec. the second challenge to be addressed when dealing with series and grids of 
least squares problems is the management of large datasets. An example clarifies the situation; in 
the characteristic scenario in which n = 30,000, p = 10 (r = 2), m = 10^, and t = 10^, the input 
dataset is of size 2.4 TBs, and the computation generates 4 TBs as output. Since such datasets 
exceed the main memory capacity of current shared-memory nodes and therefore reside on disk, one 
has to resort to an out-of-core algorithm m- The main idea behind our design is to effectively tile 
(block) the computation to amortize the time spent in moving data. 

To emphasize the need for tiling, we commence by discussing the I/O requirements of a naive 
out-of-core implementation of Alg.[^ as sketched in Alg.[^ First, the algorithm requires the loading 
of the entire set of vectors yj (line 2) for the computations in the initial loop. Then, it loads once 
the entire set of matrices Xfii (line 6), loads m times the entire set of vectors yj (line 9), and finally 
requires the storage of the resulting m x t vectors bij (line 11). The reading of yj for a total of m 
times (line 9) is the clear I/O bottleneck. For the aforementioned example, the algorithm generates 
12 petabytes of disk-to-memory traffic, which at a (rather optimistic) transfer rate of 2 GBytes/sec, 
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Algorithm 2 : Structure of Q and R exposed and exploited 

1 

{Ql,Rtl} '.= qr{XL) 

(qr) 

2nl^ 

2 

for j := 1 to t do 



3 

■= Qlvj 

(gemv) 

2tln 

4 

kj := Rj,j^ZTj 

(trsv) 


5 

end for 



6 

for i := 1 to m do 



7 

Rtr^ '.= Q"lXr^ 

(gemm) 

2mlnr 

8 

Ti := Xr^ — QLRvRi 

(gemm) 

2mlnr 

9 

{Qfli, Asfl. } ~ qr{Ti) 

(qr) 

2mnr^ 

10 

hi 

(trsm) 

mPr 

11 

for j := 1 to t do 



12 

~ Qr^Vj 

(gemv) 

2mtrn 

13 

bSij RsRiZ'Bij 

(trsv) 

mtr^ 

14 

bTij ■= kj — hibB^j 

(gemv) 

2mtlr 

15 

end for 



16 

end for 



would take 70 days of I/O. It is thus imperative to reorganize I/O and computation to greatly reduce 

the amount of generated I/O traffic. 



Algorithm 3 : Naive out-of-core 

approach 

1 

Load matrix Xr 

4ln — >■ 

< 1 MBs 

2 

for j := 1 to t do 



3 

Load vector yj 

4tn — >■ 

1.2 GBs 

4 

Compute with yj 



5 

end for 



6 

for i := 1 to m do 



7 

Load matrix Ajj. 

Amur — >■ 2.4 TBs 

8 

Compute with A_r. 



9 

for j := 1 to t do 



10 

Load vector yj 

4mtn —> 12 PBs 

11 

Compute with Xiii,yj 



12 

Store vector j3ij 

4mtp 

4 TBs 

13 

end for 



14 

end for 




3.1 Analysis of computational cost over data movement 

A tiled algorithm decomposes the computation of the 2D grid of problems into subgrids or tiles. 
Instead of loading one single matrix and vector yj^ and storing one single vector bij, the idea 
is to load and store multiple of them in a slab. The tiled algorithm presented in Alg. loads slabs 
of tb vectors yj (Y) and mt matrices Xj^. (A/j) in lines 4, 11, and 19; and stores slabs of computed 
mb X tb vectors bij (B) in line 27. 

With tiling, the amount of I/O required by line 19 of Alg. [^is constrained to 

m 

t X n X —, 
mb 


6 












Algorithm 4 : Tiled algorithm: Data loaded and stored in slabs 

1 

Load Xl 



2 

{Ql,Rtl} := qr{XL) 

(qr) 

2nf 

3 

for j := 1 to t/tb do 



4 

Load slab Yj 



5 

for 1 := 1 to tb do 



6 

Zt,, ■■= QlYj, 

(gemv) 

2tln 

7 

^31 ^TlZTj^ 

(trsv) 

tp 

8 

end for 



9 

end for 



10 

for i := 1 to mlrrib do 



11 

Load slab 



12 

for k := 1 to mb do 



13 

■= QJ,Xr-^ 

(gemm) 

2mlnr 

14 

fi^ ■= Xr,^ - QhRTRi^ 

(gemm) 

2mlnr 

15 

{QRi^ , Rbr^^ } := ) 

(qr) 

2mnr^ 

16 

Hik ■= ^TL^TRii^ 

(trsm) 

mPr 

17 

end for 



18 

for j 1 to t/tb do 



19 

Load slab Yj 



20 

for k := 1 to mb do 



21 

for 1 := 1 to tb do 



22 


(gemv) 

2mlrn 

23 

Ss,,,, := Rrr,^ 

(trsv) 

mtr^ 

24 

Bt := Kjj — Hi.Bb 

(gemv) 

2milr 

25 

end for 



26 

end for 



27 

Store slab Bij 



28 

end for 



29 

end for 
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and, most importantly, can be adjusted by setting the parameter mt,. We note that, in terms of 
I/O overhead, there is no need to set to the largest possible value allowed by the available main 
memory: it suffices to choose rub large enough so that the I/O bottleneck shifts to the loading of 
Xfi and writing of B: 

m 

t X n X — <S^mxnxr + mxtxp. 
mb 

After choosing a sufficiently large value for mb, the ratio of computation over data movement is 

0{mtrn) _ 0(trn) 

0{mnr + mtp) 0{nr + tp) 

Therefore, beyond the freedom in parameterizing mb, it is the actual instance of the equation, that 
is, the problem sizes, which determines whether the computation is compute-bound or lO-bound. 
We illustrate the different scenarios (memory-bound vs lO-bound) in Sec. where we present 
experimental results. 

3.2 Overlapping I/O with computation: double buffering 

For problems that are not largely dominated by I/O, it is possible to reduce or even completely elim¬ 
inate the overhead due to I/O operations by overlapping I/O with computation. In the development 
of OLS-GRID, we adopted the well-known double buffering mechanism to asynchronously load and 
store data: the main memory is logically split into two buffers; while operations within an iteration 
of the innermost loop are performed on the slabs previously loaded in one of the buffers, a dedicated 
I/O thread operating on the other buffer downloads the results of the previous iteration and uploads 
the slabs for the next one. The experiments in Sec. conhrm that this mechanism mitigates the 
negative effect of data transfers and, for compute-bound scenarios, it prevents I/O from limiting the 
scalability of our solver. 

4 High performance and scalability 

In the previous sections, we designed an algorithm that avoids redundant computations, and studied 
the impact of data transfers to constrain (or even eliminate) the overhead due to I/O. One last issue 
remains to be addressed: how to exploit shared-memory parallelism. In this section, we illustrate 
how to reorganize the computation so that it can be cast in terms of efficient BLAS-3 operations, 
and discuss how to combine different types of parallelism to attain scalability. 

The computational bottleneck of Alg. is the operation Q^.yj in line 22, which is executed 
m X t times. This is confirmed visually by Fig. [l] which presents, for n = 1,000 and varying m 
and t, the weight of that operation as a percent of the total computation (flops) performed by the 
algorithm. Already for small values of m and t (in the hundreds), the operation accounts for 90% of 
the computation. When m and t take more realistic values (in the thousands or more), the percent 
attains values close to 100%. 

The operation Q^ Uj is implemented by the inefficient BLAS-2 GEMV kernel, which on a single¬ 
core only attains about 10% of the peak performance, and on multi-cores suffers from poor scalability. 
The idea to overcome this bottleneck is to combine the mt, matrices into a block , and the tb 
vectors yj into a block Yj. This transformation effectively recasts the small matrix-vector products 
(gemv’s) into the large and much more efficient matrix-matrix operation Qj^ Yj (gemm), which 
achieves both close-to-optimal performance and high scalability. 

While the use of a multi-threaded version of the BLAS library for the gemm operation Cf^ Yj 
enables high scalability, it is a poor choice for other sections of the algorithm, which do not scale 





Fig. 1. Percentage of the computation performed by the operation QJi.yj from the total operations required 
by Algorithm]^ 


as nicely. In fact, when using a large number of cores (as is the case in our experiments), the 
weight of these other sections increases to the point that it affects the overall scalability. As a 
solution to mitigate this problem we turn to a hybrid parallelism combining multi-threaded BLAS 
with OpenMP. Lines 13, 14, and 16 in Alg. are examples of operations that do not scale with 
a mere use of a multi-threaded BLAS. Let us illustrate the issue with the matrix product in line 
13. One of these GEMM’s in isolation multiplies Q"£ € with Xj^. € that is, it multiplies 

a wide matrix with only a few rows times a thin matrix with only a few columns. Even when mt, 
Xfl.’s are combined into the block Xu., one of the matrices (Ql) is still rather small. In terms of 
number of operations per element, the ratio is very low, which results in an inefficient and poorly 
scalable operation. In this case, using a multi-threaded BLAS is not sufficient (in our experiments we 
observed a performance of about 5 GF/s, i.e., below 1% efficiency). Instead, we decide to explicitly 
split the operation among the compute threads by means of OpenMP directives. Each thread takes 
a proportional number of columns from Xr. and computes the corresponding GEMM using a single- 
threaded BLAS call. This second alternative results in a speedup of 5x to lOx, sufficient to mitigate 
the impact in the overall performance and scalability. 


The resulting algorithm is displayed in Alg. Among the computational bottlenecks, we high¬ 
light in green the operations parallelized by means of a multi-threaded version of BLAS (line 18) 
and in blue the operations parallelized using OpenMP (lines 10-12). As we show in the next sec¬ 
tion, the recasting of small operations into larger ones and the use of a hybrid form of parallelism 
that combines multi-threaded BLAS and OpenMP leads to satisfactory performance and scalability 
signatures. 
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Algorithm 5 : Efficient and scalable algorithm: OLS-GRID 

1 

Load Xl 



2 

{Ql,Rtl} ~ qr{XL) 

(qr) 

2nZ^ 

3 

for j := 1 to t/tb do 



4 

Load slab Yj 



5 

Zt, := QlY, 

(gemm) 

2tln 

6 

Kj ~ Rj,j^ZTj 

(trsm) 


7 

end for 



8 

for i := 1 to m/mb do 



9 

Load slab Xji. 



10 

R-TRi ■= QJ,Xr. 

(gemm) 

2mlnr 

11 

Ti := Xr- — QhRrRi 

(gemm) 

2mlnr 

12 

Hi := R^/RTRi 

(trsm) 

ml^r 

13 

for k := 1 to mi, do 



14 


(qr) 

2mnr^ 

15 

end for 



16 

for j 1 to t/tb do 



17 

Load slab Yj 



18 

Zb,, 

(gemm) 

2mtrn 

19 

for k := 1 to mb do 



20 


(trsm) 

mtr^ 

21 

:= Kj - 

(gemm) 

2mt\v 

22 

end for 



23 

Store slab Bij 



24 

end for 



25 

end for 
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5 Experimental results 


In this section, OLS-GRID (the implementation of Alg. is tested in a variety of settings to provide 
evidence that it makes a nearly optimal use of the available resources; it is also compared and 
contrasted with a number of available alternatives, namely a naive solver (the linear regression 
solver from ProbABEL), a specialized solver for grids of GLS (generalized least squares) problems 
(OmicABEL-Eig), and a specialized solver for grids of OLS problems (GWASP). 

All our tests were run on a system consisting of 4 Intel(R) Xeon(R) E7-4850 Westmere-EX 
multi-core processors. Each processor comprises 10 cores operating at a frequency of 2.00 GHz, 
for a combined peak performance in single precision of 640 GFlops/sec. The system is equipped 
with 256 GBs of RAM and 8 TBs of disk as secondary memory; our measurements indicate that 
the Lustre file system attains a maximum bandwidth of about 300 MBs/sec and 1.7 GBs/sec for 
writing and reading operations, respectively. The solver was compiled using Intel’s icc compiler 
(vl4.0.1), and linked to Intel’s MKL multi-threaded library (vl4.0.1), which provides for BLAS and 
LAPAGK functionality. The routine makes use of the OpenMP parallelism provided by the compiler 
through a number of pragma directives. All computations were performed in single precision, and 
the correctness of the results was assessed by direct comparison with the OLS solver from LAPAGK 
(SGELS). 

For the asynchronous I/O transfers, we initially tested the AIO library (available in all Unix 
systems). We observed that I/O operations had a considerable impact in performance by limiting 
the flops attained by GEMM. Since AIO does not offer the means to specify the amount of threads 
spawned for I/O and to pin threads to cores, we developed our own light-weight library which uses 
one single thread for I/O operations and allows thread pinning|^ 

5.1 Compute-bound vs lO-bound sceuarios 

We commence the study of our own solver by showing the practical implications of the ratio com¬ 
putation over I/O transfers 

0{trn) 

0{nr + tp) 

We collected the time spent in computations and the time spent in I/O operations for a range of 
problems varying the size of n. Figure [^presents these results. The horizontal red line corresponds to 
a ratio of 1, while the other three lines represent the ratio compute time over I/O time for different 
values of n and for an increasing number of compute threads. 

While the bandwidth is fixed, the computational power increases with the number of compute 
threads. For n = 5,000 and n = 20,000, the I/O time eventually overtakes computation time (the 
ratio becomes < 1). At that point, the problem becomes lO-bound and the use of additional compute 
threads does not reduce the time-to-solution. For n = 30,000 instead, the time spent in I/O never 
grows larger than the time spent in computation and I/O is perfectly overlapped. 

In the next experiment we show how the results in Fig. [^translate into scalability results. 


5.2 Scalability 

We study now the scalability of OLS-GRID. Figure [^presents scalability results for the same problem 
sizes discussed in the previous experiment. The red diagonal line represents perfect scalability. As 

^ The library is publicly available at http://hpac. rwth-aachen. de/~f abregat/software/lwaio-vO. 1. 
tgz 
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Number of compute threads 

Fig. 2. Ratio of compute time over I/O time. Results for a varying value of n and an increasing number 
of compute threads. The other problem sizes are fixed: Z = 4, r = 1, m = 10®, t = 10"*, mb = 10^, and 
tb = 5x 10®. 


the ratio in Fig. [^suggested, the line for n = 5,000 shows perfect scalability for up to 8 threads; from 
that point on, the I/O transfers dominate the execution time and no larger speedups are possible. 
Similarly, for n = 20,000, almost perfect scalability is attained with up to 28 compute threads. Again, 
beyond that number, the I/O dominates and the scalability plateaus. Instead, for n = 30,000, the 
I/O never dominates execution time and the solver attains speedups of around 34x when 36 compute 
threads are used. In all cases, we observe a drop in scalability when 40 compute threads are used; 
this is because one compute thread and the I/O thread share one core. We attribute the drop at 
39 compute threads for the experiment with n = 30,000—and, less apparently, at n = 5,000 and 
n = 20,000—to potential memory conflicts, since in the used architecture each pair of threads share 
the LI and L2 caches. 

An important message to extract from these results is the need to understand the characteristics 
of the problem at hand to decide which architecture hts best our needs. In the case of omics GWAA, 
the size of n plays an important role in the decision of whether investing in further computational 
power or larger bandwidth. 


5.3 Efficiency 


As a final result, we quantify how efficiently OLS-GRID uses the available resources. To this end, we 
measured the time-to-solution for a variety of scenarios using 36 compute threads and compared 
the performance with that of the practical peak performance, that is, the best performance attained 
by GEMM. The results are presented in Fig. the tested scenarios include all combinations of 
n = (5,000, 30,000), p = (5,10), m = (10®, lO”^), and t = (10®, 10^). While for small sizes of n the 
I/O transfers limit the efficiency, with a maximum of 0.3, for large values of n the attained efficiency 
ranges from 0.64 to 0.93. 
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Number of compute threads 


Fig. 3. Scalability results for a varying value of n. The other problem sizes are fixed: Z = 4, r = 1, m = 10®, 
t = 10^, mt = 10"^, and tf, = 5 x 10®. 


5.4 Comparison with alternative solvers 

We conclude the performance study with a detailed discussion on how OLS-GRID compares to other 
existing solvers. All presented timings were obtained in the same experimental setup as the previous 
sections, and making use of the available 40 cores. 

ProbABEL is a collection of tools for linear and logistic regression, as well as Cox proportional 
hazards models |16| . While in this discussion we concentrate on its dehciencies in terms of per¬ 
formance for large-scale linear regression analyses, we want to clarify that this is a versatile tool 
offering functionality well beyond the core linear regression solver. ProbABEL’s OLS solver does 
not take advantage of the structure of the problem at hand, and thus delivers poor performance. 
For instance, an analysis with sizes n = 30,000, p = 5 (Z = 4, p = 1), m = 10,000, and t = 100 takes 
almost half an hour, while OLS-GRiD completes in less than two seconds. The gap in performance 
between a traditional black-box solver and a carefully designed solver is enormous: if the previous 
analysis were extended to m = 10^ and t = 10"^, ProbABEL’s algorithm would become unusable 
(estimated completion time of more than 13 years), while our OLS-GRID completes the analysis in 
6.9 hours. 

OmicABEL-Eig is the multi-trait {t > 1) solver for GLS-based analyses in OmicABEL |5ll5j . 
The concepts behind OmicABEL-Eig are similar to those discussed in this paper; in particular, the 
algorithm relies on the reduction of GLS problems to OLS ones. However, the dependencies in the 
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Scenario (n - p - m -1) 


Fig. 4. Efficiency of our solver for a range of scenarios using 36 compute threads. GEMm’s peak performance 
is used as practical peak performance of the architecture. 


grid to be solved are different, and computationally the consequences are huge. On the one hand, 
while the asymptotic complexity of both OmicABEL-Eig and OLS-GRID is the same, the constants 
for the former are large. On the other hand, even though OmicABEL-Eig shows great scalability, 
its efficiency compared to the present solver is low because it cannot take advantage of GEMM. As 
a result, the same analysis that took OLS-GRID 6.9 hours to complete, would require a month of 
computation for OmicABEL-Eig. In short, even the use of an existing fully optimized solver for a 
similar regression analysis may not be the best choice, and it is worth the effort of designing a new 
one tailored to the needs of the specific problem at hand. 


GWASP is a recently developed solver tailored for the computation of a grid of OLS problems 
similar to that addressed in this paper m- Thanks to techniques such as the aggregation of vector- 
vector and matrix-vector operations into matrix-vector and matrix-matrix operations, respectively, 
GWASP delivers an efficient algorithm. Unfortunately, a direct comparison with OLS-GRID is not 
possible, since 1) this solver only computes the last r entries of f3ij, 2) the focus is on single-trait 
{t = 1) analyses, and 3) only R code is provided. 

In order to study GWASP’s performance, we implemented the algorithm in G using the BLAS and 
LAPAGK libraries, and ran a number of single-trait experiments. For a problem of size n = 5,000, 
p = 5 {I = 4, r = 1), m = 10®, and t = 1, GWASP took 61 seconds, while OLS-GRID completed in 14 
seconds. Similarly, if n is increased to n = 30,000, the analysis completes in 314 and 84 seconds for 
GWASP and OLS-GRID, respectively. Since the computational complexity of both solvers is similar 
and these scenarios are lO-bound, the differences mainly reside in the careful hybrid parallelization 
of OLS-GRID and the overlapping of data transfers. 

For the general case of f > 1, GWASP may only be used one trait at a time. In the last scenario 
with t = 1,000, it would require days to complete. By contrast, OLS-GRID finishes in 138 seconds, 
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thanks to the optimizations enabled when considering the 2D grid of problems as a whole. Despite 
this relatively large gap, we believe that GWASP can be extended to multi-trait analyses and to the 
computation of entire /3’s, while achieving performance comparable to that of our solver. 

6 Conclusions 

We addressed the design and implementation of efficient solvers for large-scale linear regression 
analyses. As case study, we focused on the computation of a two-dimensional grid of ordinary least 
squares problems as it appears in the context of genome-wide association analyses. The resulting 
routine, OLS-GRID, showed to be highly efficient and scalable. 

Starting from the mathematical description of the problem, we designed an incore algorithm 
that exploits the available problem-specific knowledge and structure. Next, to enable the solution of 
problems with large datasets that do not fit in today multi-core’s main memory, we transformed the 
incore algorithm into an out-of-core one, which, thanks to tiling, constrains the amount of required 
data movement. By incorporating the double-buffering technique and using an asynchronous I/O 
library, we completely eliminated I/O overhead whenever possible. 

Finally, by reorganizing the calculations, the computational bottleneck was cast in terms of the 
highly efficient GEMM routine. Combining multi-threaded BLAS and OpenMP parallelism, OLS-GRID 
also attains high performance and high scalability. More specifically, for large enough analyses, the 
solver achieves single-core efficiency beyond 90% of peak performance, and speedups of up to 34x 
with 36 cores. 

While previously existing tools allow for multiple trait analysis, these are limited to small datasets 
and require considerable runtimes. We enable the analysis of previously intractable datasets and offer 
shorter times to solution. Our OLS-GRiD solver is already integrated in the GenABEL suite, and 
adopted by a number of research groups. 

6.1 Future work 

Two main research directions remain open. On the one hand, support for distributed-memory ar¬ 
chitectures is desirable, allowing for further reduction in time-to-solution. This step will require a 
careful distribution of workload among nodes and the use of advanced I/O techniques to prevent 
data movement from becoming a bottleneck. 

On the other hand, we are already working on the support for so-called missing and erroneous 
data. Often, large datasets are collected by combining data from different sources. Therefore, subsets 
of data for certain individuals (for instance, entries in yj vectors) may not be available and labeled 
as missing values (NaNs). Erroneous data may origin, for instance, from errors in the acquisition. 
Accepting input data with NaN values (and preprocessing it accordingly) will make our software of 
broader appeal and will facilitate its adoption by the broad GWAA community. 
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